Artificial Intelligence in Medicine
○ Elsevier BV
Preprints posted in the last 90 days, ranked by how well they match Artificial Intelligence in Medicine's content profile, based on 15 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit.
Cai, L.; Zhang, T.; Beets-Tan, R.; Brunekreef, J.; Teuwen, J.
Show abstract
The use of Electronic Health Records (EHRs) has increased significantly in recent years. However, a substantial portion of the clinical data remains in unstructured text formats, especially in the context of radiology. This limits the application of EHRs for automated analysis in oncology research. Pretrained language models have been utilized to extract feature embeddings from these reports for downstream clinical applications, such as treatment response and survival prediction. However, a thorough investigation into which pretrained models produce the most effective features for rectal cancer survival prediction has not yet been done. This study explores the performance of five Dutch pretrained language models, including two publicly available models (RobBERT and MedRoBERTa.nl) and three developed in-house for the purpose of this study (RecRoBERT, BRecRoBERT, and BRec2RoBERT) with training on distinct Dutch-only corpora, in predicting overall survival and disease-free survival outcomes in rectal cancer patients. Our results showed that our in-house developed BRecRoBERT, a RoBERTa-based language model trained from scratch on a combination of Dutch breast and rectal cancer corpora, delivered the best predictive performance for both survival tasks, achieving a C-index of 0.65 (0.57, 0.73) for overall survival and 0.71 (0.64, 0.78) for disease-free survival. It outperformed models trained on general Dutch corpora (RobBERT) or Dutch hospital clinical notes (MedRoBERTa.nl). BRecRoBERT demonstrated the potential capability to predict survival in rectal cancer patients using Dutch radiology reports at diagnosis. This study highlights the value of pretrained language models that incorporate domain-specific knowledge for downstream clinical applications. Furthermore, it proves that utilizing data from related domains can improve the quality of feature embeddings for certain clinical tasks, particularly in situations where domain-specific data is scarce.
Khashei, I.; Presciani, D.; Martinelli, L. P.; Grosjean, S.
Show abstract
Retrieval-augmented generation (RAG) is increasingly adopted to ground clinical conversational agents in external knowledge sources, yet many deployed prototypes lack the observability required for standard RAG evaluation. In particular, retrieved documents and grounding context are often not logged, preventing direct assessment of retrieval quality and faithfulness. We report a post-hoc evaluation of EMSy, a clinical RAG-based chatbot prototype, based on 2,660 multi-turn conversations collected between January and September 2025. Rather than benchmarking performance, we adopt an evaluation strategy based exclusively on observable signals. The analysis combines an exploratory intent analysis conducted on a random subset of heterogeneous interactions, automated quality scores available at the message and conversation level, and explicit user feedback, with 96.0% of rated conversations receiving positive feedback. Results indicate that message-level minimum scores capture localized low-quality responses that are not reflected by average conversation-level metrics, while user feedback reflects aggregate interaction impressions. This case study illustrates how diagnostic insights can be obtained under limited observability and identifies implications for the design and evaluation of future clinical RAG systems.
Romagnoli, F.; Pellegrini, M.
Show abstract
BackgroundThe ideal of personalized medicine is to support the clinical decision process towards the right drug for the right patient at the right time, by using, among other diagnostic tools, molecular biomarkers that are specifically dependent on the patient status and on the therapeutic options. Several challenges must be overcome to realize this vision. Patients present a wide spectrum of genetic variability even before developing diseases, and disease like cancer add an extra layer of mutations, while only a very small fraction of such variants have diagnostic or prognostic value. Moreover it is also challenging to predict how the patient will respond to a specific drug based on the patients omic profiling, since any drug introduces further perturbations in the biochemical model. MethodsIn this paper we propose the method Personalized-DrugRank for joint prediction of therapy response and time-to-response for cancer patients undergoing pharmacological therapy after surgery. The method is based on personalizing the DrugMerge methodology for drug repositioning in order to extract a few synthetic indices useful as input to ML prediction tools. In particular the proposed methodology is a novel and principled approach to merging independent patient-specific transcriptomic data with drug perturbation data from cell line assays. One of the key novel features of our approach over the state of the art is the joint prediction of the response of the patient to therapy along with an estimate of the time-to-response (i.e the prediction of the time needed for the therapy to succeed or fail). FindingsWe tested our methodology on data from the TCGA (The Cancer Genome Atlas) Program for three cancer types (Breast, Stomach and Colorectal cancer), 10 pharmacological regimens and 13 homogeneous cohorts. For the therapy response prediction task we developed models that attain an average AUC performance 0.749, average pvalue 0.030, average accuracy 0.809 with balanced Positive and Negative Predicting Values. For the time-to-event prediction task we developed regression models for the 13 homogeneous cohorts that attain an average (geometric) Concordance Index performance 0.782 (max 0.904, min 0.651) with average log likelihood pvalue 0.004, improving in nine cohorts over 13 upon models based only on clinical parameters having average Concordance Index 0.678 and average p-value 0.006. Interestingly, we attain statistical significant results even with quite small therapy-homogeneous cohorts (ranging from a minimum of 7 patients to a maximum of 32). ConclusionsThe ability of predicting with high accuracy the response of a cancer patient to a chosen pharmacological therapeutic regimen along with an estimate of the time-to-response helps adapting the clinical decision process to the specific patient profile, thus increasing the likelihood of providing correct and timely therapeutic decisions.
Dharmavaram, S.; Bhanushali, P.
Show abstract
Overcrowding of emergency departments (ED) is now a problem of global health care concern due to the increase in patients. Triage systems have been established for a considerable period. However, their reliability in choosing the appropriate patient and the level of service has undergone much scrutiny. In this paper, we describe a comprehensive machine learning framework aimed at predicting critical emergency department outcomes and enabling dynamic routing decisions. Through the MIMIC-IV-ED database, which comprises more than 440,000 emergency visits, we design and assess varied predictive models, which include classical clinical scores, interpretable ML systems, classical algorithms, and deep learning architectures. We investigate three significant outcomes: hospitalization post-ED visit, critical deterioration (ICU transfer/death within 12 hours), 72-hour re-attendance in ED. The results indicate that gradient boosting algorithms can make better predictions with AUROCs of 0.820, 0.881, and 0.699 as compared to standard clinical scoring systems and complex deep learning models. The interpretable AutoScore framework which combines clinical performance with clinical transparency. We also study patterns of feature importance across prediction tasks. Moreover, we talk about how these can be implemented in real-time clinical workflows. This study builds a reproducible benchmarking platform for ED prediction research. In addition, it presents evidence-based recommendations for intelligent patient routing systems that can help enhance emergency care efficiency and resource utilization while improving patient outcomes in a high-pressure environment.
Windisch, P.; Koechli, C.; Dennstaedt, F.; Aebersold, D. M.; Zwahlen, D. R.; Foerster, R.; Schroeder, C.
Show abstract
PurposeTo quantify run-to-run reproducibility of Gemini 3 Flash Preview and GPT-5.2 for biomedical trial-success classification across temperature and reasoning/thinking settings, and to assess whether single-run reporting is sufficient. MethodsWe utilized 250 randomized controlled oncology trial abstracts labeled POSITIVE/NEGATIVE based on primary endpoint success. With a fixed prompt requiring exactly "POSITIVE" or "NEGATIVE", we evaluated Gemini across thinking levels (minimal, low, medium, high) and temperatures 0.0 - 2.0, and GPT-5.2 across reasoning-effort levels (none to xhigh) with an additional temperature sweep when reasoning was disabled. Each setting was run three times. Reproducibility was quantified with Fleiss {kappa} across replicates, performance was summarized with F1 (per run and majority vote), and invalid-format outputs were recorded. ResultsGemini showed near-perfect agreement across settings ({kappa}=0.942 - 1.000), including perfect agreement at temperature 0. Invalid outputs were uncommon (0 - 1.5%). GPT-5.2 reproducibility was similarly high ({kappa}=0.984 - 0.995) with no invalid outputs. Performance remained stable (mean/majority-vote F1 = 0.955 - 0.971), and majority voting offered only marginal gains. ConclusionFor strict binary biomedical classification with tightly constrained outputs, both models were highly reproducible across common decoding and reasoning configurations, indicating that one run is often adequate while minimal replication provides a practical stability check.
Salome, P.; Knoll, M.; Walz, D.; Cogno, N.; Dedeoglu, A. S.; Qi, A. L.; Isakoff, S. J.; Abdollahi, A.; Jimenez, R. B.; Bitterman, D. S.; Paganetti, H.; Chamseddine, I.
Show abstract
IntroductionManual data extraction from unstructured clinical notes is labor-intensive and impractical for large-scale clinical and research operations. Existing automated approaches typically require large language models, dedicated computational infrastructure, and/or task-specific fine-tuning that depends on curated data. The objective of this study is to enable accurate extraction with smaller locally deployed models using a disease-site specific pipeline and prompt configuration that are optimized and reusable. Materials/MethodsWe developed OncoRAG, a four-phase pipeline that (1) generates feature-specific search terms via ontology enrichment, (2) constructs a clinical knowledge graph from notes using biomedical named entity recognition, (3) retrieves relevant context using graph-diffusion reranking, and (4) extracts features via structured prompts. We ran OncoRAG using Microsoft Phi-3-medium-instruct (14B parameters), a mid-size language model deployed locally via Ollama. The pipeline was applied to three cohorts: triple-negative breast cancer (TNBC; npatients=104, nfeatures=42; primary development), recurrent high-grade glioma (RiCi; npatients=191, nfeatures=19; cross-lingual validation in German), and MIMIC-IV (npatients=100, nfeatures=10; external testing). Downstream task utility was assessed by comparing survival models for 3-year progression-free survival built from automatically extracted versus manually curated features. ResultsThe pipeline achieved mean F1 scores of 0.80 {+/-} 0.07 (TNBC; npatients=44, nfeatures=42), 0.79 {+/-} 0.12 (RiCi; npatients=61, nfeatures=19), and 0.84 {+/-} 0.06 (MIMIC-IV; npatients=100, nfeatures=10) on test sets under the automatic configuration. Compared to direct LLM prompting and naive RAG baselines, OncoRAG improved the mean F1-score by 0.19 to 0.22 and 0.17 to 0.19, respectively. Manual configuration refinement further improved the F1-score to 0.83 (TNBC) and 0.81 (RiCi), with no change in MIMIC-IV. Extraction time averaged 1.7-1.9 seconds per feature with the 14B model. Substituting a smaller 3.8B model reduced extraction time by 57%, with a decrease in F1-score (0.03-0.10). For TNBC, the extraction time was reduced from approximately two weeks of manual abstraction to under 2.5 hours. In an exploratory survival analysis, models using automatically extracted features showed a comparable C-index to those with manual curation (0.77 vs 0.76; 12 events). ConclusionsOncoRAG, deployed locally using a mid-size language model, achieved accurate feature extraction from multilingual oncology notes without fine-tuning. It was validated against manual extraction for both retrieval accuracy and survival model development. This locally deployable approach, which requires no external data sharing, addresses a critical bottleneck in scalable oncology research. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=89 SRC="FIGDIR/small/26347717v1_ufig1.gif" ALT="Figure 1000"> View larger version (23K): org.highwire.dtl.DTLVardef@178a4e8org.highwire.dtl.DTLVardef@1928b7corg.highwire.dtl.DTLVardef@38f36org.highwire.dtl.DTLVardef@1af4d51_HPS_FORMAT_FIGEXP M_FIG C_FIG
Mazzucato, S.; Seinen, T. M.; Moccia, S.; Micera, S.; Bandini, A.; van Mulligen, E. M.
Show abstract
ObjectiveNamed Entity Recognition (NER) and Biomedical Entity Linking (BEL) are essential for transforming unstructured Electronic Health Records (EHRs) into structured information. However, tools for these tasks are limited in non-English biomedical texts such as Dutch and Italian. This study investigates the use of prompt-based learning with Large Language Models (LLMs) to perform multilingual NER and BEL using minimal domainspecific data, while addressing annotation preservation during corpus translation. MethodsAn English-annotated corpus from the ShARe/CLEF dataset was translated into Dutch and Italian using a strategy that embeds annotations directly into the text prior to translation and retrieves them afterwards. GPT-4o was applied in zero-shot and few-shot settings to extract biomedical entities, which were then mapped to Unified Medical Language System Concept Unique Identifiers using contextual word embeddings. Performance was evaluated with precision, recall, and F1-score, and compared with goldstandard clinician annotations. ResultsThe multilingual NER pipeline achieved strong performance, with an overall F1-score of 0.98 across languages. BEL experiments showed reliable entity normalization, with an overall accuracy of 0.91 and a mean reciprocal rank of 0.95. The combined performance of the NER and BEL achieved 0.90 supporting the utility of LLMs in standardizing biomedical concepts across languages. ConclusionPrompt-based LLMs can effectively perform NER and BEL in languages with less annotated resources, even with limited annotated training data. The proposed annotation-preserving translation method, combined with generative and discriminative LLM capabilities, provides a scalable approach to multilingual clinical information extraction. These findings highlight the potential for broader adoption of LLM-based natural language processing systems to support multilingual healthcare data harmonization. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=133 SRC="FIGDIR/small/26344605v1_ufig1.gif" ALT="Figure 1"> View larger version (27K): org.highwire.dtl.DTLVardef@100c385org.highwire.dtl.DTLVardef@12467d8org.highwire.dtl.DTLVardef@11d9ca5org.highwire.dtl.DTLVardef@1173e3d_HPS_FORMAT_FIGEXP M_FIG C_FIG HighlightsO_LIThis study shows the feasibility of using prompt-based learning with large language models (LLMs) to perform multilingual named entity recognition (NER) and biomedical entity linking (BEL) in Dutch and Italian, two languages with less annotated resources. C_LIO_LIAn annotation-preserving translation strategy was proposed to adapt the ShARe/CLEF eHealth corpus, enabling consistent evaluation across English, Dutch, and Italian without loss of gold-standard annotations. C_LIO_LIThe multilingual NER pipeline achieved strong overall performance (F1-score: 0.89), while BEL experiments showed reliable entity normalization (F1-score: 0.64, MRR: 0.68) to standardized clinical concepts. C_LIO_LIThe approach highlights the potential of generative and discriminative LLM capabilities for scalable multilingual clinical information extraction, supporting broader European initiatives for cross-lingual health data harmonization. C_LI
Xu, S.; Wang, Z.; Wang, H.; Ding, Z.; Zou, Y.; Cao, Y.
Show abstract
Online cancer peer-support communities generate large volumes of patient-authored and caregiver-authored text that may reflect distress, coping, and informational needs. Automated emotional tone classification could support scalable monitoring, but supervised modeling depends on label quality and may benefit from explicit context features. Using the Mental Health Insights: Vulnerable Cancer Survivors & Caregivers dataset, we compared five model families (TF-IDF Logistic Regression, Random Forest, LightGBM, GRU, and fine-tuned ALBERT) on a three-class target (Negative/Neutral/Positive) derived from four original categories. We introduced two extensions: (i) LLM-based annotation to generate parallel "AI labels" and (ii) token-based augmentation that prepends LLM-extracted structured variables (reporter role and cancer type) to the post text. Models were trained with a 60/20/20 stratified train/validation/test split, with hyperparameters selected on validation data only. Test performance was summarized using weighted F1 and macro one-vs-rest AUC with bootstrap confidence intervals, with paired comparisons based on McNemar tests and false discovery rate adjustment. The LLM annotator produced substantial redistribution in the four-class label space, shifting prevalence toward very negative relative to the original labels; the shift persisted but attenuated after collapsing to three classes. Across all model families, token augmen-tation improved held-out performance, with the largest gains for GRU and consistent improvements for ALBERT. Augmentation also reduced polarity-reversing errors (Nega-{leftrightarrow} tive Positive) for ALBERT, while adjacent errors (Negative {leftrightarrow} Neutral) remained the dominant residual failure mode. These results indicate that LLM-based supervision can introduce systematic measurement shifts that require auditing, yet LLM-extracted context incorporated via simple token augmentation provides a pragmatic, model-agnostic mechanism to improve downstream emotional tone classification for supportive oncology decision support. Author summaryWe studied how to better monitor emotional tone in posts from online cancer peer-support communities, where patients and caregivers share experiences that may signal distress, coping, or unmet needs. Automated classification could help organizations and moderators identify when additional support may be needed, but these systems depend on the quality of the labels used for training and may miss clinical context. Using a public dataset of cancer survivor and caregiver posts, we trained and compared several machine-learning and deep-learning models to classify each post as negative, neutral, or positive. We tested two practical improvements. First, we used a large language model to generate an additional set of "AI labels" and examined how these differed from the original categories. Second, we extracted simple context information--whether the writer was a patient or caregiver and what cancer type was mentioned--and added this context to the text before model training. We found that adding context consistently improved performance across model types. However, the AI-generated labels shifted class distributions, indicating that automated labeling can introduce systematic changes that should be audited. Overall, simple context extraction can make emotional tone monitoring more accurate and useful for supportive oncology decision support.
Liu, X.; Garg, M.; Jeon, E.; Jia, H.; Sauver, J. S.; Pagali, S. R.; Sohn, S.
Show abstract
Clinical narrative text contains crucial patient information, yet reliable extraction remains challenging due to linguistic variability, documentation habits, and differences across care settings. Large language models (LLMs) have shown strong accuracy on clinical information extraction (IE), but their reproducibility (stability under repeated runs) and robustness (stability under small, natural prompt variations) are less consistently quantified, despite being central to clinical deployment. In this study, we evaluate three open-weight LLMs representing distinct modeling choices: a dense general-purpose model (Llama 3.3), a mixture-of-experts (MoE) general-purpose model (Llama 4), and a domain-tuned medical model (MedGemma). We focus on binary clinical IE aligned with four mobility classes from the International Classification of Functioning, Disability and Health (ICF) framework. Using a controlled experimental design, we quantify (1) intra-prompt reproducibility across repeated sampling and (2) inter-prompt robustness across paraphrased prompts. We jointly report predictive performance (F1-score) and stability (Fleiss' Kappa [{kappa}]). And we test factor effects using three-way ANOVA with post-hoc comparisons. Results show that increasing temperature generally degrades agreement, but the magnitude depends on model and task; furthermore, prompt paraphrasing can substantially reduce stability, with particularly large drops for the MoE model. Finally, we evaluate a practical mitigation, self-consistency via majority voting, which improves {kappa} substantially and often improves or preserves F1-score, at the cost of additional inference. Together, these findings provide a reproducible framework and concrete recommendations for evaluating and improving LLM reliability in clinical IE.
Weyrich, J.; Dennstaedt, F.; Foerster, R.; Schroeder, C.; Aebersold, D. M.; Zwahlen, D. R.; Windisch, P.
Show abstract
PurposeLarge language models (LLMs) offer significant potential for automating the classification of clinical trials by eligibility criteria. However, a critical question remains regarding the optimal input data: while abstracts provide a condensed, high-density signal, full-text articles contain a much higher volume of information. It remains unclear whether the additional signal found in full texts improves classification performance or if the accompanying noise (in the form of thousands of words irrelevant to the question at hand in a complete manuscript) negatively affects the models reasoning capabilities. MethodsGPT-5 was applied to classify 200 randomized controlled oncology trials from high-impact medical journals, labeling them whether patients with localized and/or metastatic disease were eligible for inclusion. Each trial was classified twice - once using only the abstract and once using the full text - and GPT-5s outputs were compared with the ground-truth labels established by manual annotation. Performance was assessed by calculating and comparing accuracy, precision, recall, and F1 score, and the McNemar test was used to assess the statistical significance of the differences between the two input formats. ResultsFor identifying trials including patients with localized disease, GPT-5 achieved an accuracy of 86% (95% CI: 81% - 91%; F1 = 0.90) when using abstracts and 92% (95% CI: 88% - 95%; F1 = 0.92) when using full texts (p = 0.027). Performance for detecting trials, which include patients with metastatic disease, was comparably high, with accuracies of 99% (95% CI: 99% - 100%; F1 = 1.00) based on abstracts and 98% (95% CI: 97% - 100%; F1 = 0.99) based on full texts. Overall accuracy for assigning combined labels per trial increased from 86% (95% CI: 81% - 91%) using abstracts to 92% (95% CI: 88% - 95%) using full texts (p = 0.027). ConclusionProviding full-text articles to GPT-5 significantly improved the classification of trial eligibility criteria. These findings suggest that, for this task, the benefit of the additional signal contained within the full text outweighed the potential for performance degradation caused by increased noise. Utilizing full-text analysis appears particularly valuable for extracting specific eligibility criteria in oncology that are frequently omitted or not explicitly described within the abstract.
Yu, A.; Weile, J.; Courtot, M.
Show abstract
MotivationMedical documents are a crucial resource for medical research around the world. While troves of valuable health data exist, they are largely computationally inaccessible as hard copies of unstructured text. Moreover, the persistent prevalence of fax machines in medical settings contributes to further degradation of document quality. Digitization of these resources through manual data extraction is time-consuming and resource intensive. However, large language models (LLMs) have recently shown great promise for automated digitization and information extraction (IE), greatly improving upon previous tools in terms of speed and accuracy. ResultsWe reviewed recent LLM-based tools for named entity recognition (NER) and IE from the literature and assessed them with respect to their suitability for use in a clinical setting. We found only two of these tools to be usable out of the box and compared them to LLM foundation models prompted to perform extractions. Using 1000 mock medical documents with paired reference data, we evaluated the tools performance in different scenarios, comparing zero-shot and one-shot prompts as well as unimodal and multimodal (image and text) inputs where possible. The most effective model was OpenAIs GPT 4.1-mini with an average F1 score of 55.6. The best performing local model was Googles Gemma3 with 27B parameters, given image inputs and a zero-shot prompt, with an average F1 score of 41.3. We found the choice of prompting strategy to have minimal impact on extraction performances. We also assessed the effects of image distortions commonly introduced by fax machines and found a significant impact on extraction performance. AvailabilitySource code and data are available on Github at https://github.com/courtotlab/PDF_benchmarking. Supplementary informationSupplementary data are available at Journal Name online.
Dai, H.-J.; Mir, T. H.; Fang, L.-C.; Chen, C.-T.; Feng, H.-H.; Lai, J.-R.; Hsu, H.-C.; Nandy, P.; Panchal, O.; Liao, W.-H.; Tien, Y.-Z.; Chen, P.-Z.; Lin, Y.-R.; Jonnagaddala, J.
Show abstract
Accurate recognition and deidentification of sensitive health information (SHI) in spoken dialogues requires multimodal algorithms that can understand medical language and contextual nuance. However, the recognition and deidentification risks expose sensitive health information (SHI). Additionally, the variability and complexity of medical terminology, along with the inherent biases in medical datasets, further complicate this task. This study introduces the SREDH/AI-Cup 2025 Medical Speech Sensitive Information Recognition Challenge, which focuses on two tasks: Task-1: Speech transcription systems must accurately transcribe speech into text; and Task-2: Medical speech de-identification to detect and appropriately classify mentions of SHI. The competition attracted 246 teams; top-performing systems achieved a mixed error rate (MER) of 0.1147 and a macro F1-score of 0.7103, with average MER and macro F1-score of 0.3539 and 0.2696, respectively. Results were presented at the IW-DMRN workshop in 2025. Notably, the results reveal that LLMs were prevalent across both tasks: 97.5% of teams adopted LLMs for Task 1 and 100% for Task 2. Highlighting their growing role in healthcare. Furthermore, we finetuned six models, demonstrating strong precision ([~]0.885-0.889) with slightly lower recall ([~]0.830-0.847), resulting in F1-scores of 0.857-0.867.
Weissenbacher, D.; Shabbir, M.; Campbell, I. M.; Berdahl, C. T.; Gonzalez-Hernandez, G.
Show abstract
Background: Large language models (LLMs) contain limited professional medical knowledge, as large-scale training on clinical text has not yet been possible due to restricted access. Objectives: To continue pre-training an open-access instruct LLM on de-identified medical notes and evaluate the resulting impact on real-world clinical decision-making tasks and standard benchmarks. Methods: Using 500K de-identified clinical notes from Cedars-Sinai Health System, we fine-tuned a Qwen3-4B Instruct model with supervised learning to generate medical decision-making (MDM) paragraphs from patient presentations, and evaluated it on assigned-diagnosis prediction, in-hospital cardiac-arrest mention detection, and a suite of general and biomedical benchmarks. Results: The fine-tuned model produced MDMs that closely resembled those written by physicians and outperformed the base-instruct model and larger clinically untrained models (Qwen3-32B and Llama-3.1-405B Instruct) on assigned-diagnosis prediction, the task most aligned with its training objective. On the task of detecting in-hospital cardiac arrest mentions, the model initially exhibited mild label collapse, but a brief task-specific fine-tuning stage resolved this issue and allowed it to surpass all competitors. The model also demonstrated global general knowledge retention on biomedical and general-domain evaluation benchmarks compared to the baseline. Conclusion: Supervised full fine-tuning on clinical notes allowed the model to incorporate medical knowledge without sacrificing general-domain abilities, and to transfer this knowledge to unseen biomedical tasks without wholesale loss of general-domain abilities, while revealing collapse-related failure modes that motivate more principled strategies for clinical specialization.
Hakata, Y.; Oikawa, M.; Fujisawa, S.
Show abstract
Background. Adult diffuse glioma is a representative class of primary brain tumors for which accurate MRI-based tumor segmentation is indispensable for treatment planning. Conventional automated segmentation methods have relied primarily on image information and spatial prompts, and auxiliary clinical information that is routinely acquired in clinical practice has not been sufficiently exploited as an input. Objective. Building on a dual-prompt-driven Segment Anything Model (SAM) extension framework that fuses visual and language reference prompts, we propose a method that integrates patient demographics, unsupervised molecular cluster variables derived from TCGA high-throughput profiling, and histopathological parameters as learnable prompt embeddings, and we evaluate its effect on the accuracy of lower-grade glioma (LGG) MRI segmentation. Methods. An auxiliary prompt encoder converts clinical metadata into high-dimensional embeddings that are fused with the prompt representations of Segment Anything Model (SAM) ViT-B through a cross-attention fusion mechanism. The TCGA-LGG MRI Segmentation dataset (Kaggle release by Buda et al.; n = 110 patients; WHO grade II-III) was split at the patient level (train/val/test = 71/17/22) using three different random seeds, and the three slices with the largest tumor area were extracted from each patient. To avoid pseudo-replication arising from multiple slices per patient and repeated measurements across seeds, our primary analysis aggregated Dice and 95th-percentile Hausdorff distance (HD95) to the patient x seed unit (n = 66); secondary analyses at the unique-patient level (n = 22) and at the per-slice level (n = 198) are also reported. Pairwise comparisons used paired t-tests with Bonferroni correction (k = 3) and Wilcoxon signed-rank tests, and a permutation test (K = 30) served as an auxiliary check of effective use of the auxiliary information. Results. At the patient x seed level (n = 66), Proposed (full clinical) achieved a Dice gain of +0.287 over the zero-shot SAM ViT-B baseline (paired-t p = 4.2 x 10^-15, Cohen's d_z = +1.25, Bonferroni-corrected p << 0.001; Wilcoxon p = 2.0 x 10^-10), and HD95 improved from 218.2 to 64.6. Because zero-shot SAM is not designed for domain-specific medical segmentation, the large absolute HD95 gap largely reflects the expected domain gap rather than a competitive baseline. The additional contribution of the full clinical configuration over the demographics-only configuration was Dice = +0.023 (paired-t p = 0.057, Bonferroni-corrected p = 0.172), which did not reach statistical significance at the patient level and is reported as a directional trend. The permutation test (K = 30, seed 2025) yielded real-metadata Dice = 0.819 versus a shuffled-metadata mean of 0.773, giving an empirical p = 0.032 = 1/(K + 1), which is at the resolution limit of this test and should therefore be interpreted as preliminary evidence. Conclusions. Integrating auxiliary clinical information as multimodal prompts produced a large improvement over the zero-shot SAM baseline on this LGG cohort. More importantly, a robustness analysis showed that Proposed (full clinical) outperformed the trained Base (no auxiliary information) under all tested spatial-prompt conditions, including perfect centroid (+0.014), and that the advantage was most pronounced in the prompt-free regime (+0.231, p = 0.039), where the base model collapsed but the proposed model maintained meaningful segmentation by leveraging clinical metadata alone. The additional contribution of molecular and histopathological information beyond demographics was not statistically resolved at the patient level (+0.023, n.s.). Establishing clinical utility will require external validation on larger multi-center cohorts and direct comparisons with established segmentation methods. Keywords: brain tumor segmentation; Segment Anything Model (SAM); vision-language prompt-driven segmentation; auxiliary clinical prompts; multimodal learning; TCGA-LGG; deep learning
Jonnalagadda, P.; Obeng-Gyasi, S.; Stover, D. G.; Andersen, B. L.; Rahurkar, S.
Show abstract
BackgroundMany patients with triple-negative breast cancer (TNBC), particularly those who are older, Black, or insured by Medicaid, do not receive guideline-concordant treatment, despite its association with up to 4x higher survival. Early identification of patients at risk for rapid relapse may enable timely interventions and improve outcomes. This study applies machine learning (ML) to real-world data to predict risk of rapid relapse in TNBC. MethodsWe trained various ML models (logistic regression, decision trees, random forests, XGBoost, naive Bayes, support vector machines) using National Cancer Database (NCDB) data and fine-tuned them using electronic health record (EHR) data from a cancer registry. Class imbalance was addressed using synthetic minority oversampling technique (SMOTE). Model performance was evaluated using sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), receiver operating characteristics area under the curve ROC AUC, accuracy, and F1 scores. Transfer learning, cross-validation, and threshold optimization were applied to enhance the ensemble models performance on clinical data. ResultsInitial models trained on NCDB data exhibited high NPV but low sensitivity and PPV. SMOTE and hyperparameter tuning produced modest improvements. External testing on EHR data from a cancer registry had similar model performance. After applying transfer learning, cross-validation, and threshold optimization using the clinical data, the ensemble model achieved higher performance. The optimized ensemble model achieved a sensitivity of 0.87, specificity of 0.99, PPV of 0.90, NPV of 0.98, ROC AUC of 0.99, accuracy of 0.98, and F1-score of 0.88. This optimized model, leveraging readily available clinical data, demonstrated superior performance compared to initial NCDB-trained models and those reported in extant literature. ConclusionsTransfer learning and threshold optimization effectively adapted ML models trained on NCDB data to an independent real-world clinical dataset from a single site, producing a high-performing model for predicting rapid relapse in TNBC. This model, potentially translatable to fast health interoperability resources (FHIR)-compatible workflows, represents a promising tool for identifying patients at high risk. Future work should include prospective external validation, evaluation of integration into clinical workflows, and implementation studies to determine whether the model improves care processes such as timely patient navigation and treatment planning. Author SummaryIn this study, we set out to understand which patients with triple-negative breast cancer might experience a rapid return of their disease. Many people with this aggressive form of cancer do not receive the treatments that are known to improve survival, especially patients who are older, Black, or insured through public programs. Being able to identify those at highest risk early in their care could help health teams provide timely support and ensure that patients receive the treatments they need. To do this, we used information from a large national cancer database to build computer-based models that learn from patterns in patient data. We then refined these models using real medical records from a cancer center to make sure they worked well in everyday clinical settings. After adjusting and improving the models, we developed a tool that can correctly identify most patients who are likely to have a rapid return of their cancer. Our hope is that this type of tool could eventually be built into routine care and help guide timely follow-up, support services, and treatment planning. More testing in real clinical environments will be important to understand how well the tool improves care and outcomes for patients.
Ray, P.
Show abstract
Thyroid carcinoma is one of the most prevalent endocrine malignancies worldwide, and accurate preoperative differentiation between benign and malignant thyroid nodules remains clinically challenging. Diagnostic methods that medical practitioners use at present depend on their personal judgment to evaluate both imaging results and separate clinical tests, which creates inconsistency that leads to incorrect medical evaluations. The combination of radiological imaging with clinical information systems enables healthcare providers to enhance their capacity to make reliable predictions about patient outcomes while improving their decision-making abilities. The study introduces a deep learning framework that utilizes multiple data sources by combining magnetic resonance imaging (MRI) data with clinical text to predict thyroid cancer. The system uses a Vision Transformer (ViT) to obtain advanced MRI scan features, while a domain-adapted language model processes clinical documents that contain patient medical history and symptoms and laboratory results. The cross-modal attention system enables the system to merge imaging data with textual information from different sources, which helps to identify how the two types of data are interconnected. The system uses a classification layer to classify the fused features, which allows it to determine the probability of cancerous tumors. The experimental results show that the proposed multimodal system achieves better results than the unimodal base systems because it has higher accuracy, sensitivity, specificity, and AUC values, which help medical personnel to make better preoperative decisions.
Corga Da Silva, R.; Romano, M.; Mendes, T.; Isidoro, M.; Ravichandran, S.; Kumar, S.; van der Heijden, M.; Fail, O.; Gnanapragasam, V. E.
Show abstract
Background: Clinical documentation and information retrieval consume over half of physicians working hours, contributing to cognitive overload and burnout. While artificial intelligence offers a potential solution, concerns over hallucinations and source reliability have limited adoption at the point of care. Objective: To evaluate clinician-reported time savings, decision-making support, and satisfaction with DR. INFO, an agentic AI clinical assistant, in routine clinical practice. Methods: In this prospective, single-arm pilot study, 29 clinicians across multiple specialties in Portuguese healthcare institutions used DR. INFO v1.0 over five working days within a two-week period. Outcomes were assessed via daily Likert-scale evaluations and a final Net Promoter Score. Non-parametric methods were used throughout. Results: Clinicians reported high perceived time saving (mean 4.27/5; 95% CI: 3.97-4.57) and decision support (4.16/5; 95% CI: 3.86-4.45), with ratings stable across all study days and no evidence of attrition bias. The NPS was 81.2, with no detractors. Conclusions: Clinicians across specialties and career stages reported sustained satisfaction with DR. INFO for both time efficiency and clinical decision support. Validation in larger, controlled studies with objective outcome measures is warranted. Keywords: Medical AI assistant, LLMs in healthcare, Agentic AI, Clinical decision support, Point of care AI
Kamalov, F.; Thabtah, F.; Peebles, D.; Ibrahim, A.
Show abstract
Timely and accurate diagnosis of dementia remains a critical yet challenging task. Although machine learning (ML) techniques have shown considerable promise in dementia detection, their inherent complexity often results in opaque, "black-box" models that limit clinical acceptance and usability. In this paper, we propose a Selectively Augmented Decision Tree (SADT), an interpretable AI model specifically designed for dementia detection. SADT incorporates a structured three-phase pipeline consisting of feature selection, data balancing, and construction of a transparent decision tree classifier. We apply SADT to the OASIS dataset and evaluate it empirically, showing that SADT outperforms traditional ML benchmarks, validating its effectiveness. In addition to its superior performance, SADT also mirrors aspects of human decision-making in its sequential, rule-based prioritization of key features. This approach aligns with cognitive models of cue use and heuristic reasoning, making it not only clinically transparent but also psychologically aligned with how diagnostic decisions are often made in practice. SADTs strong predictive performance and interpretability grounded in human reasoning facilitates explanation and human scrutiny, and has the potential to improve both clinical decision-making and trust in AI-assisted diagnosis.
Bian, R.; Cheng, W.
Show abstract
The rapid development of large language models (LLMs) has stimulated growing interest in their use for medical question answering and clinical decision support. However, compared with frontier proprietary systems, the empirical understanding of lightweight open-source LLMs in medical settings remains limited, particularly under resource-constrained experimental conditions. To address this gap, we introduce MedScope, a lightweight benchmarking framework for systematically evaluating open-source LLMs on medical multiple-choice question answering. Using 1,000 sampled questions from MedMCQA, we benchmark six lightweight open-source models spanning three representative model families: LLaMA, Qwen, and Gemma. Beyond standard predictive metrics such as accuracy and macro-F1, our framework additionally considers inference time, prediction consistency, subject-wise variability, and model-specific error patterns. We further develop a set of multi-perspective visual analyses, including clustered heatmaps, agreement matrices, Pareto-style trade-off plots, radar charts, and multi-panel summary figures, in order to characterize model behavior in a more interpretable and comprehensive manner. Our results reveal substantial heterogeneity across models in predictive performance, efficiency, and subject-level robustness. While larger lightweight models generally achieve better overall results, the gain is neither uniform across subject categories nor always aligned with efficiency. These findings suggest that lightweight open-source LLMs remain valuable as transparent and reproducible medical AI baselines, but their current capabilities are still insufficient for unsupervised deployment in high-risk healthcare scenarios. MedScope provides an accessible benchmark for evaluating lightweight medical LLMs and emphasizes the need for multi-dimensional assessment beyond accuracy alone.The relevant code is now open-sourced at: https://github.com/VhoCheng/MedScope.
Mazzucato, S.; Bandini, A.; Sartiano, D.; Vergaro, G.; Dalmiani, S.; Emdin, M.; Micera, S.; Oddo, C. M.; Passino, C.; Moccia, S.
Show abstract
PurposeNatural Language Processing (NLP) has the potential to extract structured clinical knowledge from unstructured Electronic Health Records (EHRs). However, the limited availability of annotated datasets for algorithm training restricts its application in clinical practice. This study investigates the use of transformer-based NLP models to structure Italian EHRs in cardiac settings, addressing this gap. MethodsWe implemented and evaluated three named entity recognition algorithms: SpaCy, Flair, and Multiconer. The experiments utilized three datasets comprising 2235 anamneses from patients at the Fondazione Toscana Gabriele Monasterio, Italy. ResultsThe SpaCy model achieved the highest performance with an F1-score of 97% in identifying clinical features on explicitly mentioned entities (Presence/Absence classification). However, features are not always mentioned, as clinicians selectively document only clinically relevant information in real-world practice. External validation shows model generalizability: EVD-100 dataset (considering 12 features, 97.13% F1) and STEMI dataset (considering 3 shared features, 88.29% F1). These structured variables were subsequently used to train machine learning algorithms (Logistic Regression, XGBoost, CatBoost) for classifying amyloidosis in heart failure patients. The classifiers trained on SpaCy-structured data attained an average F1-score of 66.70%, closely matching the 66.99% F1-score from classifiers using clinician-annotated data. ConclusionThis study shows the feasibility of using NLP for structuring Italian EHRs in realistic clinical settings, highlighting its potential to enhance computer-assisted detection despite selective documentation patterns. The comparable performance across annotation methods suggests NLPs capability to bridge the gap in dataset annotation, paving the way for its integration into clinical practice.